Front-end challenges Sintesis_pertuturan

Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full of homographs, numbers and abbreviations that all ultimately require expansion into a phonetic representation.

There are many words in English which are pronounced differently based on context. Some examples:

  • project: My latest project is to learn how to better project my voice.
  • bow: The girl with the bow in her hair was told to bow deeply when greeting her superiors.

Most TTS systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well-understood, or computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like looking at neighboring words and using statistics about frequency of occurrence.

Deciding how to convert numbers is another problem TTS systems have to address. It is a fairly simple programming challenge to convert a number into words, like 1325 becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts in texts, and 1325 should probably be read as "thirteen twenty-five" when part of an address (1325 Main St.) and as "one three two five" if it is the last four digits of a social security number. Often a TTS system can infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the systems provide a way to specify the type of context if it is ambiguous.

Similarly, abbreviations like "etc." are easily rendered as "et cetera", but often abbreviations can be ambiguous. For example, the abbreviation "in." in the following example: "Yesterday it rained 3 in. Take 1 out, then put 3 in." "St." can also be ambiguous: "St. John St." TTS systems with intelligent front ends can make educated guesses about how to deal with ambiguous abbreviations, while others do the same thing in all cases, resulting in nonsensical but sometimes comical outputs: "Yesterday it rained three in." or "Take one out, then put three inches."

Text-to-phoneme challenges

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion, as phoneme is the term used by linguists to describe distinctive sounds in a language.

The simplest approach to text-to-phoneme conversion is the dictionary-based approach. In this approach, a large dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.

The other approach used for text-to-phoneme conversion is the rule-based approach. In this approach, rules for the pronunciations of words are applied to words to work out their pronunciations based on their spellings. This is similar to the "sounding out" approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach has the advantages of being quick and accurate, but it completely fails if it is given a word which is not in its dictionary, and as dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as it takes into account irregular spellings or pronunciations. As a result, nearly all speech synthesis systems use a combination of both approaches.

Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on the spelling works correctly in nearly all instances. Speech synthesis systems for languages like this often use the rule-based approach as the core approach for text-to-phoneme conversion, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciation is not obvious from the spelling. On the other hand, speech synthesis for languages like English, which have extremely irregular spelling systems, often rely mostly on dictionaries and use rule-based approaches only for unusual words or names that aren't in the dictionary.